ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
作者信息
SJTU Minyi Guo小导师jieru zhang组 Jieru Zhao's Homepage
链接:
[2412.03213] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
摘要
Large Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.【从语义簇的角度调回】 We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2× speedup in latency and a 2.5× improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency.
一句话总结概括
语义簇稀疏化attention
Motivation
Attention计算的稀疏性:部分KV需要和Q进行Attention需要计算
在语义空间中,比较近的attention weight会计算出一个比较近似的attention结果
- 用余弦相似度代表向量距离可以更好地表达语义空间
- 注意力汇
创新点或贡献
Cluster Attention:语义空间中,k和q离得越近的tokens簇,在attention计算中关系越大。
具体设计
- 对key vectors进行Kmeans计算,保存16个attention sinks离异点。且这里强调了只对生成的tokens进行聚类,而不对prefill的tokens进行聚类。